fix(deploy): prevent concurrent project mutations from half-deleting deployments#144
Open
uittenbroekrobbert wants to merge 1 commit into
Open
fix(deploy): prevent concurrent project mutations from half-deleting deployments#144uittenbroekrobbert wants to merge 1 commit into
uittenbroekrobbert wants to merge 1 commit into
Conversation
…deployments Two sibling deployment deletes of the same project ran concurrently, collided on the shared project file during git push, hit an unrecoverable rebase conflict, and left the deployment half-deleted (toets-hn7/pr-36). The same bug class produced a durable orphan that nothing detected (toets-hn7/pr-32). - Serialize project-file-mutating tasks per project in the claim guard, so deploys/updates/deletes of one project no longer race on projects/<p>.yaml and the per-project ArgoCD kustomization. Backups/restores stay per-deployment. - Self-heal the project-file push: on an unmergeable rebase, reset to the current remote, re-apply the intended change, and retry instead of failing terminally (new typed GitPushConflictError + reapply hook, wired into the delete path). - Add a read-only GET /api/v2/admin/deployments/drift report that flags deployments live in ArgoCD but absent from the project file. Verified with a real Postgres (claim serialization), real git (push self-heal convergence), and against live production data (drift report flags pr-32 only).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
In production, a delete of deployment
pr-36in projecttoets-hn7failed terminally: two sibling deployment deletes of the same project ran concurrently, collided on the shared project file duringgit push, hit an unrecoverable rebase conflict, and left the deployment half-deleted. Only a blind task retry recovered it. The same bug class produced a durable orphan that nothing detected:toets-hn7/pr-32was removed from the project file on 2026-06-12 but its ArgoCD Application and pods kept running for 12 days.What
1a. Serialize project-file-mutating tasks per project. The in-flight claim guard keyed on
project_nameanddeployment_name, so sibling deployments of one project ran concurrently and raced onprojects/<project>.yamland the per-project ArgoCDkustomization.yaml. The guard now serializes any two project-file-mutating tasks of the same project (PROJECT_FILE_MUTATING_TASK_TYPES). Clone/backup/restore stay per-deployment so a slow restore never blocks deploys. Cross-project work still runs in parallel.1b. Self-heal the project-file push on a true rebase conflict.
GitConnector.push_changes/commit_and_pushaccept an optionalreapplycallback. On an unmergeable rebase it hard-resets to the current remote, re-applies the intended change on fresh content, and retries, instead of failing terminally with a bareRuntimeError. A new typedGitPushConflictErroris raised when noreapplyis given. The deployment-delete path passes areapplythat re-reads the project file fresh and re-removes the deployment. This also covers the auto-tuner, which commits to project files outside the task system.2a. Read-only deployment drift report.
GET /api/v2/admin/deployments/driftcompares deployments declared in project files against live ArgoCD Applications and reportsorphaned_deployments(live but undeclared, the pr-32 case) andmissing_deployments. Zero mutations. Project-infrastructure apps ({project}-infrastructure) are excluded.No behaviour change for single, non-concurrent operations.
Testing
test_async_task_claim_serialization_db.py,requires_infra): proves the actual claim SQL on a live DB. The pr-36/pr-37 scenario claims only one; backups are not over-serialized; cross-project deletes both claim.test_git_push_conflict_integration.py): forces an actual rebase conflict against a local bare remote and proves the reapply path converges while preserving the concurrent writer's change.toets-hn7/pr-32), zero false positives, zero missing.Follow-up (not in this PR)
toets-hn7/pr-32orphan (runbook infeatures/deployment-delete-race-prevention.md) and clear any stalemarked_for_deletionrow fortoets-hn7/pr-36.reapplyhook into the upsert/update-image/auto-tune writers for full coverage.